Efficient Evaluation of Nondeterministic Automata Using Factorization Forests
نویسندگان
چکیده
In the first part of the paper, we propose an algorithm which inputs an NFA A and a word a1 · · · an, does a precomputation, and then answers queries of the form: “is the infix ai · · · aj accepted by A?”. The precomputation is in time poly(A) · n, and the queries are answered in time poly(A). This improves on previous algorithms that worked with the exponentially less succinct DFA’s or monoids. In the second part of the paper, we propose a transducer model for data trees. We show that the transducer can be evaluated in linear time. We use this result to evaluate XPath queries in linear time. The algorithms in both parts of the paper use factorization forests. This paper develops the use of factorization forests [8] for efficient evaluation of automata. The paper has two parts. The first part, which builds on [4, 2], uses factorization forests to evaluate automata on arbitrary infixes of a word in constant time, after a linear time precomputation. The second part, which builds on [3, 7], uses factorization forests to efficiently evaluate queries of XPath. Infix evaluation. The first part of the paper studies the following problem, which is parametrized by a regular language L ⊆ A∗. For a word a1 · · · an ∈ A∗, we want to build a data structure. Then, we want to use the data structure to quickly answer queries of the form: given two positions i ≤ j in {1, . . . , n}, answer if the infix ai · · · aj belongs to L. We call this the infix evaluation problem. A solution of the problem consists of two algorithms: the preprocessing that inputs a1 · · · an and builds the data structure, and the query answering which inputs i ≤ j and outputs the answer to ai · · · aj ∈ L. A natural solution uses a divide and conquer approach. Suppose that L is recognized by a nondeterministic automaton with states Q. The preprocessing splits the word into halves, quarters, and so on. Each such infix is decorated with the set of state pairs that describe possible runs of the automaton over the infix. The preprocessing is in time poly(Q) · n, while the query answering is in time poly(Q) · log(n). As observed by Thomas Colcombet in [4], a beautiful result of Imre Simon, called the Factorization Forest Theorem [8], can be used to answer the queries ? We acknowledge the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under the FET-Open grant agreement FOX, number FP7ICT-233599. Work supported by Polish government grant no. N N206 380037. in time independent of the word’s length. The data structure uses an algebraic approach to regular languages, where a language is recognized by a homomorphism from A∗ into a finite monoid M . The preprocessing is in time linear in |M | · n, and the query answering is in time linear in |M |. What if the language L is given by an automaton and not a homomorphism? We can always compile the automaton to a monoid and use the above result. From the point of view of the length n of the word, the preprocessing is in linear time, and the query answering is in constant time. However, compiling even a deterministic automaton into a monoid can yield an exponential blowup. This gives big constants in the linear and constant times. We can do better. If the language L is given by a deterministic automaton with states Q, a fairly straightforward structure, called the tape construction in [2], can be used to solve the problem with preprocessing in time poly(Q) · |n| and query answering in time poly(Q). In this paper, we improve the results from [4] and [2]: we give an algorithm that works with nondeterministic automata. As with the tape construction, the preprocessing is in time poly(Q) · n and the query answering in time poly(Q). The new algorithm does not use the tape construction, which does not seem to generalize from deterministic to nondeterministic automata. Instead, it builds on factorization forests. XPath evaluation. The second part of the paper is about XPath evaluation. The input for an XPath query is an XML document, which we model as a data tree. A data tree is a tree where each node carries two pieces of information: a tag name or label from a finite alphabet A, as well as a data value from an infinite alphabet D (such as integers, or unicode strings). An XPath query says “yes” or “no” to each node in a data tree. The XPath evaluation problem is to find the nodes to which the query says “yes”. There are algorithms which can solve this problem in time polynomial in the size of the query φ and the number of nodes n in the data tree, see [1] for a survey. However, with large XML documents (e.g. dblp.xml is currently 674 megabytes and millions of nodes), an algorithm that is quadratic in n is impractical. In previous work [3, 7], we have developed algorithms which are linear in n. The first algorithm, from [3], runs in time exp(φ) · n. The reason for the exponential complexity in the query is that parts of the query are represented by monoids. The algorithm works for an extension of XPath, called Regular XPath, which allows Kleene star in programs. The second algorithm, from [7], runs in time poly(φ) ·n. It works for XPath without the Kleene star. The general idea is that monoids can be avoided without the Kleene star. Both algorithms, especially the first one, use the ideas developed in the infix evaluation problem that is studied in the first part of this paper. In the second part of this paper, we propose a new approach to XPath evaluation. We introduce an automaton model, which acts as an intermediate step between XPath and the evaluation algorithm. The automaton model is a type of transducer, which we call a data aggregate transducer. Given an input data tree, a data aggregate transducer produces new labels for the nodes, and does
منابع مشابه
Finite Automata on Unranked and Unordered DAGs Extented Version
We introduce linear expressions for unrestricted dags (directed acyclic graphs) and finite deterministic and nondeterministic automata operating on them. Those dag automata are a conservative extension of the Tu,u-automata of Courcelle on unranked, unordered trees and forests. Several examples of dag languages acceptable and not acceptable by dag automata and some closure properties are given.
متن کاملA Compact Representation of Nondeterministic (Suffix) Automata for the Bit-Parallel Approach
Article history: Available online 2 February 2012 We present a novel technique, suitable for bit-parallelism, for representing both the nondeterministic automaton and the nondeterministic suffix automaton of a given string in a more compact way. Our approach is based on a particular factorization of strings which on the average allows to pack in a machine word of w bits automata state configura...
متن کاملEfficient Algorithms for Handling Nondeterministic Automata
Finite (word, tree, or omega) automata play an important role in different areas of computer science, including, for instance, formal verification. Often, deterministic automata are used for which traditional algorithms for important operations such as minimisation and inclusion checking are available. However, the use of deterministic automata implies a need to determinise nondeterministic aut...
متن کاملOn the Succinctness of Deterministic, Nondeterministic, Probabilistic and Quantum Finite Automata
We investigate the succinctness of several kinds of unary automata by studying their state complexity in accepting the family {Lm} of cyclic languages, where Lm = {a | k ∈ N}. In particular, we show that, for any m, the number of states necessary and sufficient for accepting the unary language Lm with isolated cut point on one-way probabilistic finite automata is p1 1 +p α2 2 + · · ·+ps s , wit...
متن کاملNEW DIRECTION IN FUZZY TREE AUTOMATA
In this paper, our focus of attention is the proper propagationof fuzzy degrees in determinization of $Nondeterministic$ $Fuzzy$$Finite$ $Tree$ $Automata$ (NFFTA). Initially, two determinizationmethods are introduced which have some limitations (one inbehavior preserving and other in type of fuzzy operations). Inorder to eliminate these limitations and increasing theefficiency of FFTA, we defin...
متن کامل